Mouse Cerebral Cortex Analysis

Scott the Scientist did an analysis of RNA-Seq data from the development of the mouse cortex at days -8, -4, 0 , 1, 7, 16, 21, and 26 taken from the Allen Brain Atlas Developing Mouse Portal http://developingmouse.brain-map.org/. This type of data is know as time series data since the features are taken through time.

Preparation of the Mouse Homologs Data

Scott begins by reading in the data and preparing the data frame. The columns are days at which the samples are collected where DayNegN, Day0, and DayPosN means N days before birth, day of birth, and N days after bith. The entries in the columns are the amount of RNA for each gene detected on that day in the mouse embryo cerebral cortex. We then change the columns’ name to their actual meaning and create a matrix for future analysis.

# read data as data frame
Mouse.df <-read.csv("datasets/MouseHomologData.csv", row.names = 1) 
head(Mouse.df)  # preview of the data frame
##            DayNeg8     DayNeg4        Day0     DayPos1    DayPos7    DayPos16
## A1BG   -0.54006172  1.62018517  1.62018517 -0.54006172 -0.5400617 -0.54006172
## A4GALT  0.77955207 -0.01579633  1.82626020  0.64704748 -0.6928331 -0.84321867
## AAAS    2.02780897  0.97337784  0.06271547 -0.47846790 -0.4818574 -0.70894012
## AADAT   0.05395975 -1.45584988  0.62811270 -0.77537230  1.9295261  0.05395975
## AAMP   -0.58236089 -1.82323478  0.59797889  1.49446830  0.1356618 -0.56373317
## AANAT  -0.44728620 -0.64756360 -0.11349053 -0.04673139  2.3966529 -0.04673139
##           DayPos21   DayPos28
## A1BG   -0.54006172 -0.5400617
## A4GALT -0.86458745 -0.8364241
## AAAS   -0.71097391 -0.6836630
## AADAT  -0.04173241 -0.3926037
## AAMP    0.11103932  0.6301806
## AANAT  -0.64756360 -0.4472862
# change column name
colnames(Mouse.df)<-c("-8", "-4", "0", "1", "7", "16", "21", "28")
head(Mouse.df)  # preview of the data frame
##                 -8          -4           0           1          7          16
## A1BG   -0.54006172  1.62018517  1.62018517 -0.54006172 -0.5400617 -0.54006172
## A4GALT  0.77955207 -0.01579633  1.82626020  0.64704748 -0.6928331 -0.84321867
## AAAS    2.02780897  0.97337784  0.06271547 -0.47846790 -0.4818574 -0.70894012
## AADAT   0.05395975 -1.45584988  0.62811270 -0.77537230  1.9295261  0.05395975
## AAMP   -0.58236089 -1.82323478  0.59797889  1.49446830  0.1356618 -0.56373317
## AANAT  -0.44728620 -0.64756360 -0.11349053 -0.04673139  2.3966529 -0.04673139
##                 21         28
## A1BG   -0.54006172 -0.5400617
## A4GALT -0.86458745 -0.8364241
## AAAS   -0.71097391 -0.6836630
## AADAT  -0.04173241 -0.3926037
## AAMP    0.11103932  0.6301806
## AANAT  -0.64756360 -0.4472862
# create matrix
Mouse.matrix <- as.matrix(Mouse.df)

Note that the data already been scaled so that the analysis can focus on the shape of the time series rather than specific magnitudes. We can confirm the scaling is successful by calculating the row means and making sure their norm was near 0.

norm(rowMeans(Mouse.matrix))
## [1] 6.127969e-08

Cluster by K-means Clustering

We used K-means clustering to create five clusters based on domain knowledge: biologists believe there are five stages of brain development, so we select five clusters.

set.seed(300)
km <-kmeans(Mouse.matrix, 5)

Examining the heatmap of the K-means cluster centers, we can see that each cluster corresponds to different average peaks of gene expressions.

heatmap.2(
  x = km$centers, Colv = FALSE,
  dendrogram = "none", trace ="none",
  main = "K-means Cluster Centers",
)

We can also see the time trends in the clusters means by plotting each cluster mean as a line. The cluster means have to be reformatted into a data frame with columns Cluster, Day and Mean. This is done using the dplyr package gather() command.

# cluster means
tics <- c(-8,-4,0,1,7,16,21,28) # x-axis "tics" (for the plot)
clustermean.df <- as.data.frame(km$centers, row.names = c("1","2","3","4","5"))
clustermean.df
##           -8         -4          0          1           7         16         21
## 1 -0.4496261 -0.6106791  1.3974736  1.1913655  0.03916353 -0.5570030 -0.5703216
## 2  1.9712955  0.4686750 -0.1298583 -0.4293038 -0.27156219 -0.5828320 -0.5422287
## 3 -0.7822509 -1.0207248 -0.6142838 -0.5672292  0.28858510  0.6305905  0.8961737
## 4  1.0614302  0.5152264  1.1091815  0.2865445 -0.28002391 -0.9195562 -0.9137313
## 5 -0.9336692 -1.1442140 -0.1620285  0.3392103  1.44873958  0.3088564  0.1023566
##            28
## 1 -0.44037268
## 2 -0.48418548
## 3  1.16913936
## 4 -0.85907127
## 5  0.04074883
# reformatted
clustermean.df <- clustermean.df %>% 
  rownames_to_column("Cluster") %>% #  Make a new column called Cluster
  gather(key="Day",value="Mean", -Cluster) %>%
  convert(int(Day))  # convert Day to an integer
head(clustermean.df)
## # A tibble: 6 × 3
##   Cluster   Day   Mean
##   <chr>   <int>  <dbl>
## 1 1          -8 -0.450
## 2 2          -8  1.97 
## 3 3          -8 -0.782
## 4 4          -8  1.06 
## 5 5          -8 -0.934
## 6 1          -4 -0.611

We use the reformatted data frame and ggplot with geom_line() to make the plots. Note our use of facet_wrap() to generate individual plots by Cluster.

ggplot(clustermean.df,aes(x=Day, y=Mean, col=Cluster)) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks=tics) +
  labs(title="Cluster Centers") +
  facet_wrap(Cluster ~.)  # Use facet_wrap to make a separate plot for each cluster

Here we make separate heat map for each of the five clusters. Plot the cluster heatmaps in the order they occur in development. Note that all the heatmaps are on the same scales and that they are cluster on the genes but not the days.

heatmap.2(Mouse.matrix[km$cluster==1,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse Genes Cluster 1",
          labRow= NA,
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==2,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse GenesCluster 2",
          labRow= NA,
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==3,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse Genes Cluster 3",
          labRow= NA,
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==4,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse Genes Cluster 4",
          labRow= NA,
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==5,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse Genes Cluster 5",
          labRow= NA,
          trace ="none")

PCA Analysis and Visualization by biplot

We visualize the cluster using a biplot of two components generated by PCA which explain 74% of the variance.

# Calculate the PCA
my.pca <- prcomp(Mouse.matrix, retx=TRUE, center=TRUE, scale=TRUE)
# Summarize, to see the complete PCA result
summary(my.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0364 1.3306 0.9370 0.66801 0.59296 0.54132 0.33712
## Proportion of Variance 0.5184 0.2213 0.1098 0.05578 0.04395 0.03663 0.01421
## Cumulative Proportion  0.5184 0.7397 0.8494 0.90521 0.94917 0.98579 1.00000
##                              PC8
## Standard deviation     3.306e-11
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

We display the points in a “biplot.” The projection of the points makes an interesting disc-type shape which is less dense in the middle, like a donut.

We can see that the clusters are arranged in time order around the disc. Why do we see this?

# Calculate x and y scale limits for the biplot
t<-1.2*max(abs(my.pca$x[,1:2]))

# Generate the biplot using ggbiplot
p <- ggbiplot(my.pca,
            choices=c(1,2),  # Use PC1, PC2
            alpha=.1,        # Make dots transparent
            varname.adjust=1.5,  # Move variables names out a bit
            scale =0,       # Don't rescale data
            groups=as.factor(km$cluster)) +
     ggtitle('Mouse Biplot for PC1 and PC2') + xlim(-t,t) + ylim(-t,t) # title plot and make square

p

\(\color{blue}{Exercise~3}\)

We hypothesize that each cluster of genes represents one of the five “stages” of brain development labeled A,B,C,D,E. Assign each cluster to its corresponding stage using the biplot. For example, Cluster 5 represents Stage D of development since it peaks at the first at the point Day7.

  • Stage A = Cluster ?
  • Stage B = Cluster ?
  • Stage C = Cluster ?
  • Stage D = Cluster 5
  • Stage E = Cluster ?

\(\color{blue}{Exercise~4}\)

Examine the scalar projections of the coordinate axes in the biplot, for Days -8, -4, 0, 1, 7, 16, 21 and 28. Notice that the coordinate vectors act as hours on a “developmental time clock” that starts at Day -8. Does time on this development clock run clockwise or counterclockwise?

You’ve now completed in class Prelab5! Go to LMS and complete the online quiz.

0. Introduction

Using mouse gene and cluster technique, we find five stages for the mouse gene. Then, by the known human gene in some diseases, we intersect with mouse gene and stage we just find to find out which stage is enriched. Then, that stage is the most susceptible stage for the mouse of that disease. Then, we compare to the result of human that we already know.

1. Stages of Development Analysis

From the prelab 5, we have that the order of the cluster is 2, 4, 1, 5, 3. Thus, this time, after k-means, we set cluster 2, 4, 1, 5, 3 as A, B, C, D, E in order.

# get sample data
Mouse.df <- read.csv("datasets/MouseHomologData.csv", row.names = 1) 
colnames(Mouse.df)<-c("-8","-4","0","1","7","16","21","28")
Mouse.matrix <- as.matrix(Mouse.df)

# k-means
set.seed(300)
km <-kmeans(Mouse.matrix, 5)
rownames(km$centers) <- c('C', 'A', 'E', 'B', 'D')

Then, we do a PCA and provide the biplot using PC1 and PC2, colored according to the k-means cluster result.

# PCA
my.pca <- prcomp(Mouse.matrix, retx=TRUE, center=TRUE, scale=TRUE)

a.matrix <- km$cluster

# biplot using ggbiplot
label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
p <- ggbiplot(my.pca,
            choices = c(1,2),
            alpha = 0.1,
            varname.adjust = 1.5,  # Move variables names out a bit
            scale = 0,
            groups=label) +
     ggtitle('Mouse Biplot for PC1 and PC2')
p

Next, we create a heat map that visualizes the result, with the order from cluster ‘A’ to ‘E.’ It’s easy to see a shift of higher value from cluster ‘A’ to ‘E.’

set.seed(300)
heatmap.2(km$centers, 
          scale = "none",
          dendrogram = "none",
          Colv=FALSE,
          cexCol=1.0,
          main = "Kmeans Cluster Centers",
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==2,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse Genes Cluster A",
          cexCol=1.5,
          labRow= NA,
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==4,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse GenesCluster B",
          cexCol=1.5,
          labRow= NA,
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==1,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse Genes Cluster C",
          cexCol=1.5,
          labRow= NA,
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==5,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse Genes Cluster D",
          cexCol=1.5,
          labRow= NA,
          trace ="none")

heatmap.2(Mouse.matrix[km$cluster==3,],
          scale = "none",
          dendrogram = "row",
          Colv=FALSE,
          main = "Mouse Genes Cluster E",
          cexCol=1.5,
          labRow= NA,
          trace ="none")

2. Windows of Susceptibility Analysis of Zika

First we get the data point of Zika, and then make the biplot using these point. The PCA and cluster of data is from the mouse data

# get sample data
disease.df <- read.csv("datasets/Zikamicrocephaly_data.csv",row.names = 1)
disease_symbols <- intersect(as.character(disease.df$symbol), as.character(rownames(Mouse.df)))

label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
plot.df <- cbind.data.frame(my.pca$x, cluster=label)
myplot.df <- plot.df[disease_symbols,]

# biplot
ggplot()+
  geom_point(data = myplot.df, 
             aes(x=PC1, y=PC2, color=cluster) )+
  geom_segment(x=0, y=0, 
               aes(xend=3*my.pca$rotation[,1], yend=3*my.pca$rotation[,2]),
               arrow=arrow(length=unit(1/2,'picas')) )+ 
  geom_text(aes(x=3.3*my.pca$rotation[,1], 
                y=3.3*my.pca$rotation[,2], 
                label=c("-8","-4","0","1","7","16","21","28")), 
            size=4)

Def “cluster_pvals” to calculate pvalue and logodds of each cluster. The code come from the Lab5 handout from professor.

# Define cluster_pvals; DO NOT CHANGE!
cluster_pvals <- function(k, km, myplot.df) {
  # Inputs: k, km, myplot.df 
  # Returns: results (dataframe with clusters, pvalues, logodds)
  # Set the p-value and logodds to 0
  pvalue <- zeros(k,1)
  logodds <- zeros(k,1)
  results <- cbind.data.frame(cluster=1:k, pvalue, logodds)
  classdisease <- zeros(k,1)
  classall <- as.vector(table(km$cluster))
  # use dplyr to calculate counts for each cluster 
  temp <- myplot.df %>% 
    dplyr::group_by(cluster) %>% 
    dplyr::count(name="freq")  # Creates 'freq' column!
  classdisease[temp$cluster] <- temp$freq
  classlogodds <- zeros(k,2)
  totaldisease <- sum(classdisease)
  totalall <- sum(classall)
  # Calculate the log odds ratio for the disease
  for (i in 1:k) {
    n11 <- classdisease[i] +1                # genes in disease in cluster i 
    n21 <- totaldisease- classdisease[i] +1  # genes in disease not in cluster i
    n12 <- classall[i]-n11+1                 # genes not in disease and in cluster i
    n22 <- totalall- n11-n21 -n12+1;         # genes not in disease and not in cluster 
    res <- fisher.test(matrix(c(n11,n21,n12,n22), 2, 2))
    results[i,]$pvalue <- res$p.value
    results[i,]$logodds<- log((n11*n22)/(n12*n21))
  }
  return(results)
}

Next we use the help function “cluster_pvals” to calculte the log odds ratio of Zika disease for each cluster.

# Apply cluster_pvals using the parameters just generated
clusters <- cluster_pvals(5, km, myplot.df)

# Helper function to determine enrichment
threshold <- 0.1  # Normally set to 0.1
enriched <- function(p.value,logodds,p.threshold=0.1) {
  if ((p.value <= p.threshold) && (logodds > 0)) {
    return(TRUE)
  } 
  else {
    return(FALSE)
  }
}

# Evaluate across our results; create new column
clusters$enriched <- mapply(enriched, clusters$pvalue, clusters$logodds,threshold)

# rename the cluster name of the output matrix
clusters$cluster[clusters$cluster==2]<-"A"
clusters$cluster[clusters$cluster==4]<-"B"
clusters$cluster[clusters$cluster==1]<-"C"
clusters$cluster[clusters$cluster==5]<-"D"
clusters$cluster[clusters$cluster==3]<-"E"

# View results
kable(clusters)
cluster pvalue logodds enriched
C 0.0026364 -1.0289834 FALSE
A 0.6139669 -0.1544676 FALSE
E 0.0958240 -0.5265662 FALSE
B 0.0000000 1.5996365 TRUE
D 0.0400381 -0.7896979 FALSE

From the result, we see the stage B is enriched. Thus, for Zika disease, stage B is the most susceptable period.

3. Windows of Susceptibility Analysis of Cognitive Disorder

Same process as above, first we get the data point of Cognitive Disorder, and then make the biplot using these point. The PCA and cluster of data is from the mouse data

# get sample data
disease.df <- read.csv("datasets/Cognitive disorder_heat_map_data.csv",row.names = 1)
disease_symbols <- intersect(as.character(disease.df$symbol), as.character(rownames(Mouse.df)))

label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
plot.df <- cbind.data.frame(my.pca$x, cluster=label)
myplot.df <- plot.df[disease_symbols,]

# biplot
ggplot()+
  geom_point(data = myplot.df, 
             aes(x=PC1, y=PC2, color=cluster) )+
  geom_segment(x=0, y=0, 
               aes(xend=3*my.pca$rotation[,1], yend=3*my.pca$rotation[,2]),
               arrow=arrow(length=unit(1/2,'picas')) )+ 
  geom_text(aes(x=3.3*my.pca$rotation[,1], 
                y=3.3*my.pca$rotation[,2], 
                label=c("-8","-4","0","1","7","16","21","28")), 
            size=4)

Next we use the help function “cluster_pvals” provides in lab 5 handout and defines above to calculte the log odds ratio of Rett syndrome disease for each cluster.

# Apply cluster_pvals using the parameters just generated
clusters <- cluster_pvals(5, km, myplot.df)

# Helper function to determine enrichment
threshold <- 0.1  # Normally set to 0.1
enriched <- function(p.value,logodds,p.threshold=0.1) {
  if ((p.value <= p.threshold) && (logodds > 0)) {
    return(TRUE)
  } 
  else {
    return(FALSE)
  }
}

# Evaluate across our results; create new column
clusters$enriched <- mapply(enriched, clusters$pvalue, clusters$logodds,threshold)

# rename the cluster name of the output matrix
clusters$cluster[clusters$cluster==2]<-"A"
clusters$cluster[clusters$cluster==4]<-"B"
clusters$cluster[clusters$cluster==1]<-"C"
clusters$cluster[clusters$cluster==5]<-"D"
clusters$cluster[clusters$cluster==3]<-"E"

# View results
kable(clusters)
cluster pvalue logodds enriched
C 0.1679806 -0.1574106 FALSE
A 0.0000000 -0.6771831 FALSE
E 0.0000000 0.7325017 TRUE
B 0.0000031 -0.5988984 FALSE
D 0.0003173 0.3992603 TRUE

The result shows that stages D and E are enriched, which is the most susceptible period for Cognitive Disorder. In human data, the last stage, upper layers, is enriched. Thus, the mice and human results are match: they both enriched at the last stage.

4. My Creative Analysis

Plot the logodds & pvalue v.s. cluster of Cognitive Disorder for both human and mouse model. We observe that for both human and mouse, the logodds is a slight slop and pvalue is a “mountain” shape, which shows that the result are similar between human and mouse.

twoord.plot(lx = c(1, 2, 3, 4, 5),
            ly = c(-0.6771831, -0.5988984, -0.1574106, 0.3992603, 0.7325017), 
            rx = c(1, 2, 3, 4, 5),
            ry = c(0.0000000, 0.0000031, 0.1679806, 0.0003173, 0.0000000),
            xlab="Cluster",
            ylab="Logodds",
            rylab="Pvalue",
            main="Logodds & Pvalue of Cognitive Disorder for mouse model",
            xticklab = c('A','B','C', 'D','E'),
            lylim=c(min(clusters$logodds)-0.1, max(clusters$logodds)+0.1),
            type=c('b','b')
)

twoord.plot(lx = c(1, 2, 3, 4, 5, 6),
            ly = c(-0.4037, -0.3873, 0.1589, 0.1276, -0.02535, 0.4989), 
            rx = c(1, 2, 3, 4, 5, 6),
            ry = c(0.0001, 0.0008, 0.1853, 0.2592, 0.9014, 0),
            xlab="Cluster",
            ylab="Logodds",
            rylab="Pvalue",
            main="Logodds & Pvalue of Cognitive Disorder for human model",
            xticklab = c('Pluripotency','Neuroectoderm','Neural Differentiation', 
                         'Cortical Specification','Deep Layers', 'Upper Layers'),
            lylim=c(min(clusters$logodds)-0.1, max(clusters$logodds)+0.1),
            type=c('b','b')
)

5. Conclusions

Zika virus will cause infants to be born with microcephaly, which is smaller brain size, if mother infected during pregnancy. Stage B is the most susceptable period of Zika.

Any disorder that cognitive function of a person is significantly affected, and normal functioning is impossible without treatment, will be cognitive disorders. Alzheimer’s disease is one of the most common cognitive disorders, and it has affectes approximately 5.1 million Americans. It will causes memeory loss and other cognitive problems. For human, the last stage upper layers is most susceptible to the cognitve disorders, and similar result been found in mouse model as well. For mouse, the stages D and E are most susceptiabel, which is also the last stages.

```